Exploratory Analysis (Univariate Analysis) and Linear Regression

Load Packages & Read in the Cleaned Data Set

library(tidyverse) 
data <- read.csv("world_clean.csv")

First Model: Prediction of Birth Rate (crude_birth_rate_perK)

We are trying to predict the birth rate in a country using socioeconomic features and controlling by region. Main variables we are using are:

  • Years in school (years_in_school)
    • Level of education is a proxy to understand if women were able to receive sexual education.
  • Gross domestic product (gdp)
    • The income of a region can help us to predict if the population has good medical coverage and access to information of family planning.
  • Use of a contraceptive method (any_method)
    • Contraceptive methods hinder unplanned childbirth.
  • Mean age of childbearing (mean_age_of_childbearing)
    • The sooner a woman starts having children will increase the expected children that can have. (Also, it is important to keep in mind that this variable can have an endogenous bias.)
  • Region (region_l)
    • Childbirth might change by region regardless of other socioeconomic variables.
  • Score (score)
    • We have an estimated score based on abortion legality.

Preliminary Analyses

We are going to perform two types of analyses:

  • Linear association with the outcome
    • We will measure the association of each variable by itself with the outcome we want to predict and also analyze the residual.
  • Correlation Matrix
    • We want to see how our variables correlate between themselves.
Association of birth rate with years of school

We can appreciate a statistically significant negative association, which is bolstering our hypothesis. Also, the residual errors do not show a pattern that might alert us of any bias.

# fit the model
lm_birth_rate_predicted_by_years_in_school <- 
  lm(crude_birth_rate_perK ~  years_in_school , data = data) 
# get the coefficient table
summary(lm_birth_rate_predicted_by_years_in_school)$coeff
##                  Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     40.687403  1.2139104  33.51763 2.522863e-77
## years_in_school -2.231353  0.1273078 -17.52724 4.068120e-40
# make a residual plot
plot(lm_birth_rate_predicted_by_years_in_school$residuals, 
     pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")

Association of birth rate with wealth of the nation (gdp per capita)

We also can appreciate a statistically significant negative association, which is bolstering our hypothesis that Women with a better income have access to better medical attention and family planning.But also, the effect is very small compared with education (-3.e-4 vs -2, i.e. 10,000 smaller). Additionally, the residual errors do not show a pattern that might alert us of any bias.

# fit the model
lm_birth_rate_predicted_by_gdp <- 
  lm(crude_birth_rate_perK ~  gdp , data = data) 
# get the coefficient table
summary(lm_birth_rate_predicted_by_gdp)$coeff
##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 24.5893416583 7.440081e-01 33.049830 8.881464e-80
## gdp         -0.0003102258 3.329204e-05 -9.318316 3.443326e-17
# make a residual plot
plot(lm_birth_rate_predicted_by_gdp$residuals, 
     pch = 16, col = "violet", xlab = "Observations", ylab = "Residuals")

Association of birth rate with the use of contraceptive methods

We also can appreciate a statistically significant negative association and see that the effect is higher than gdp but lower than education and could be related with the fact that the use of a contraceptive method does not guarantee unplanned children if it is not used in a right way but a higher level of education will imply an effective use (let’s see how those 2 variables correlate in the correlation matrix). Also, the residual errors do not show a pattern that might alert us of any bias.

# fit the model
lm_birth_rate_predicted_by_any_method <- 
  lm(crude_birth_rate_perK ~  any_method , data = data) 
# get the coefficient table
summary(lm_birth_rate_predicted_by_any_method)$coeff
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 38.6168928 1.35037556  28.59715 3.034824e-66
## any_method  -0.3329505 0.02408666 -13.82302 1.836519e-29
# make a residual plot
plot(lm_birth_rate_predicted_by_any_method$residuals, 
     pch = 16, col = "green", xlab = "Observations", ylab = "Residuals")

Association of birth rate with mean age of childbearing

We see a negative association, which makes sense as the older a woman start having children, it is less probable that she will have many kids, but the association is not as robust as the previous variables. Also, the residuals seem to be loaded to low values (it is not uniform as it should be). Let’s use this variable with precautions because it can have endogenity.

# fit the model
lm_birth_rate_predicted_by_mean_age_of_childbearing <-
  lm(crude_birth_rate_perK ~  mean_age_of_childbearing , data = data) 
# get the coefficient table
summary(lm_birth_rate_predicted_by_mean_age_of_childbearing)$coeff
##                            Estimate Std. Error   t value    Pr(>|t|)
## (Intercept)              42.0714324 13.1807533  3.191884 0.001641682
## mean_age_of_childbearing -0.7492439  0.4518191 -1.658283 0.098827863
# make a residual plot
plot(lm_birth_rate_predicted_by_mean_age_of_childbearing$residuals, 
     pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")

Association of birth rate with abortion legality score

The abortion legality score is statistically positive. Notwithstanding, the residuals do not look uniform so we should use this variable with caution.

# fit the model
lm_birth_rate_predicted_by_score <- 
  lm(crude_birth_rate_perK ~  score , data = data) 
# get the coefficient table
summary(lm_birth_rate_predicted_by_score)$coeff
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 15.876226  0.9608522 16.523068 3.382000e-38
## score        2.152454  0.3140607  6.853622 1.051584e-10
# make a residual plot
plot(lm_birth_rate_predicted_by_score$residuals, 
     pch = 16, col = "brown", xlab = "Observations", ylab = "Residuals")

Association of birth rate with region

Regions have different effects on child birth, but for some cases the effect is not statistically significant.

# fit the model
lm_birth_rate_predicted_by_regionL <- 
  lm(crude_birth_rate_perK ~  region_l , data = data) 
# get the coefficient table
summary(lm_birth_rate_predicted_by_regionL)$coeff
##                                           Estimate Std. Error    t value
## (Intercept)                              12.757000   3.918524  3.2555628
## region_lCentral and Southern Asia         9.242714   4.189078  2.2063839
## region_lEastern and South-eastern Asia    2.850368   4.119603  0.6919037
## region_lEurope and Northern America      -2.474833   4.010737 -0.6170520
## region_lLatin America and the Caribbean   3.647316   4.020320  0.9072202
## region_lNorthern Africa and Western Asia  6.876040   4.072249  1.6885116
## region_lOther Oceania                    10.693000   4.232489  2.5264092
## region_lSub-saharan Africa               19.876120   3.996126  4.9738475
##                                              Pr(>|t|)
## (Intercept)                              1.335460e-03
## region_lCentral and Southern Asia        2.853080e-02
## region_lEastern and South-eastern Asia   4.898250e-01
## region_lEurope and Northern America      5.379239e-01
## region_lLatin America and the Caribbean  3.654159e-01
## region_lNorthern Africa and Western Asia 9.291981e-02
## region_lOther Oceania                    1.232246e-02
## region_lSub-saharan Africa               1.441194e-06
# make a residual plot
plot(lm_birth_rate_predicted_by_regionL$residuals, 
     pch = 16, col = "cyan", xlab = "Observations", ylab = "Residuals")

Because of this, it is sensible to use dummy coded variables for those regions where the effect is robust such as Sub-Saharan Africa, Central and Southern Asia, and Other Oceania.

# fit the model
lm_birth_rate_predicted_by_regionL <- 
  lm(crude_birth_rate_perK ~  r_Subsaharan_Africa + r_OtherOceania + r_CS.Asia , 
     data = data) 
# get the coefficient table
summary(lm_birth_rate_predicted_by_regionL)$coeff
##                      Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)         14.826151  0.5489264 27.009359 2.571208e-68
## r_Subsaharan_Africa 17.806969  1.0298772 17.290380 1.978141e-41
## r_OtherOceania       8.623849  1.8615000  4.632742 6.531027e-06
## r_CS.Asia            7.173563  1.7358577  4.132576 5.292316e-05
# make a residual plot
plot(lm_birth_rate_predicted_by_regionL$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

Matrix correlation
# select the variable we need
linear_model_variables <- data %>% 
  select(mean_age_of_childbearing, any_method, gdp, years_in_school, score)
# write a function to deal with NAs
NA2mean <- function(x) replace(x, is.na(x), mean(x, na.rm = TRUE))
# apply the function to the data frame
replaced_nulls <- replace(linear_model_variables, TRUE, 
                          lapply(linear_model_variables, NA2mean))
# get the correlation matrix
cor(replaced_nulls)
##                          mean_age_of_childbearing any_method        gdp
## mean_age_of_childbearing               1.00000000 -0.1232125  0.3153997
## any_method                            -0.12321247  1.0000000  0.2518059
## gdp                                    0.31539967  0.2518059  1.0000000
## years_in_school                        0.08497914  0.5971134  0.3338028
## score                                 -0.01238248 -0.3394830 -0.2139738
##                          years_in_school       score
## mean_age_of_childbearing      0.08497914 -0.01238248
## any_method                    0.59711345 -0.33948304
## gdp                           0.33380275 -0.21397378
## years_in_school               1.00000000 -0.40091731
## score                        -0.40091731  1.00000000

After estimating the correlation matrix of all numeric variables (we are excluding the regions), the variable any_method is highly correlated with years of education, gdp and score, but let’s see how the features interact in the model.

Using All Features for the Model

# fit the model
lm_birth_rate_all_features <- 
  lm(crude_birth_rate_perK ~ r_Subsaharan_Africa+ r_OtherOceania  + r_CS.Asia +
       mean_age_of_childbearing + any_method + gdp + years_in_school + score, 
     data = data) 
# get the coefficient table
summary(lm_birth_rate_all_features)$coeff
##                               Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)               4.325162e+01 9.119563e+00  4.7427293 4.967604e-06
## r_Subsaharan_Africa       9.708195e+00 1.111123e+00  8.7372846 5.142354e-15
## r_OtherOceania            7.007703e+00 2.034647e+00  3.4441870 7.478515e-04
## r_CS.Asia                 2.325165e+00 1.400901e+00  1.6597640 9.910836e-02
## mean_age_of_childbearing -4.676312e-01 3.013162e-01 -1.5519614 1.228365e-01
## any_method               -9.882125e-02 2.443069e-02 -4.0449637 8.446670e-05
## gdp                      -1.803369e-05 3.321743e-05 -0.5428986 5.880275e-01
## years_in_school          -8.644171e-01 1.633836e-01 -5.2907206 4.373005e-07
## score                     6.433929e-01 2.090061e-01  3.0783449 2.486984e-03
# make a residual plot
plot(lm_birth_rate_all_features$residuals, pch = 16, col = "pink",
     xlab = "Observations", ylab = "Residuals")

gdp and mean_age_of_childbearing are not statistically significant. Let’s remove them.

# fit the model
lm_birth_rate_all_features <- 
  lm(crude_birth_rate_perK ~ r_Subsaharan_Africa + r_OtherOceania + r_CS.Asia + 
       any_method + years_in_school + score, data = data) 
# get the coefficient table
summary(lm_birth_rate_all_features)$coeff
##                        Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)         29.28682882 1.96290933 14.920113 1.410970e-31
## r_Subsaharan_Africa  9.71185333 1.11341945  8.722547 4.449580e-15
## r_OtherOceania       6.87965089 2.01788841  3.409332 8.340045e-04
## r_CS.Asia            3.18456328 1.36145298  2.339092 2.063241e-02
## any_method          -0.08740959 0.02347684 -3.723227 2.765310e-04
## years_in_school     -0.93599072 0.15806711 -5.921477 2.047722e-08
## score                0.70943639 0.20275970  3.498902 6.130717e-04
# plot the residuals
plot(lm_birth_rate_all_features$residuals, pch = 16, col = "purple",
     xlab = "Observations", ylab = "Residuals")

After dropping gdp and mean_age_of_childbearing, we have all features as statically significant.

Second Model: Prediction of Infant Mortality rate (infant_mortality)

We are trying to predict the infant mortality rate in a country using socioeconomic features and controlling by region. Main variables we are using are:

  • Years in school (years_in_school)
    • Level of education might indicate better knowledge of how a parents have to take care of children.
  • Gross domestic product (gdp)
    • The income of a region can help us to predict if the population has good medical coverage.
  • Use of a contraceptive method (any_method)
    • Contraceptive methods enhance family planning and better attention to children.
  • Mean age of childbearing (mean_age_of_childbearing)
    • The sooner a woman starts having children the least is prepared to be a sensible mother.
  • Region (region_l)
    • Region can consider other socioeconomic variables not reflected on previous ones.
  • Score (score)
    • We have an estimated score based on abortion legality which is correlated with unplanned parenthood.

Preliminary Analyses

We are going to perform two types of analyses:

  • Linear association with the outcome
    • We will measure the association of each variable by itself with the outcome we want to predict and also analyze the residual.
  • Correlation Matrix
    • We already did this analysis for child birth.
Association of infant mortality rate with years of school

The results are robust sustaining the hypothesis that more years of school strongly hinder child mortality and we do not see patterns on the residuals.

# fit the model
lm_infant_mortality_predicted_by_years_in_school <- 
  lm(infant_mortality ~  years_in_school , data = data) 
# get the coefficient table
summary(lm_infant_mortality_predicted_by_years_in_school)$coeff
##                  Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)     59.914704  2.4292821  24.66354 3.301987e-58
## years_in_school -4.354876  0.2556391 -17.03525 1.103044e-38
# make a residual plot
plot(lm_infant_mortality_predicted_by_years_in_school$residuals, 
     pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")

Association of infant mortality rate with wealth of the nation (gdp per capita)

The results are also robust sustaining the hypothesis that a better income hinders child mortality but the effect is very low (-4e-4) and we do not see patterns on the residuals.

# fit the model
lm_infant_mortality_predicted_by_gdp <- 
  lm(infant_mortality ~  gdp , data = data) 
# get the coefficient table
summary(lm_infant_mortality_predicted_by_gdp)$coeff 
##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept) 26.7129193939 1.480034e+00 18.048852 7.994141e-43
## gdp         -0.0004178504 5.784906e-05 -7.223114 1.246609e-11
# make a residual plot
plot(lm_infant_mortality_predicted_by_gdp$residuals, 
     pch = 16, col = "violet", xlab = "Observations", ylab = "Residuals")

Association of infant mortality rate with the use of contraceptive methods

The results are also robust sustaining the hypothesis that using contraceptive methods hinders child mortality because parents will have planned children and we do not see patterns on the residuals.

# fit the model
lm_infant_mortality_predicted_by_any_method <- 
  lm(infant_mortality ~  any_method , data = data) 
# get the correlation table
summary(lm_infant_mortality_predicted_by_any_method)$coeff
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 54.328728 2.79659552  19.42674 1.703507e-44
## any_method  -0.620588 0.05083354 -12.20824 7.808829e-25
# make a residual plot
plot(lm_infant_mortality_predicted_by_any_method$residuals, 
     pch = 16, col = "green", xlab = "Observations", ylab = "Residuals")

Association of infant mortality rate with mean age of childbearing

We see a negative effect but it is not statistically significant. Also, residuals are negatively loaded (we should not blindly trust on using it).

# fit the model
lm_infant_mortality_predicted_by_mean_age_of_childbearing <- 
  lm(infant_mortality ~  mean_age_of_childbearing , data = data) 
# get the correlation table
summary(lm_infant_mortality_predicted_by_mean_age_of_childbearing)$coeff
##                           Estimate Std. Error   t value   Pr(>|t|)
## (Intercept)              69.696984 27.2926651  2.553689 0.01147853
## mean_age_of_childbearing -1.665865  0.9368675 -1.778122 0.07705340
# make a residual plot
plot(lm_infant_mortality_predicted_by_mean_age_of_childbearing$residuals, 
     pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")

Association of infant mortality rate with abortion legality score

The abortion legality score is statistically positive.

# fit the model
lm_infant_mortality_predicted_by_score <- 
  lm(infant_mortality ~  score , data = data) 
# get the correlation table
summary(lm_infant_mortality_predicted_by_score)$coeff
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 13.478451  1.9625691 6.867759 8.921706e-11
## score        3.181517  0.6254993 5.086365 8.683561e-07
# make a residual plot
plot(lm_infant_mortality_predicted_by_score$residuals, 
     pch = 16, col = "cyan", xlab = "Observations", ylab = "Residuals")

Association of infant mortality rate with region

Regions have different effects on child mortality, but for some cases the effect is not statistically significant.

# fit the model
lm_infant_mortality_predicted_by_regionL <- 
  lm(infant_mortality ~  region_l , data = data) 
# get the correlation table
summary(lm_infant_mortality_predicted_by_regionL)$coeff
##                                            Estimate Std. Error    t value
## (Intercept)                               3.4950000   8.385827 0.41677464
## region_lCentral and Southern Asia        20.5928571   8.964826 2.29707268
## region_lEastern and South-eastern Asia   12.4575000   8.894513 1.40058264
## region_lEurope and Northern America       0.3640244   8.587924 0.04238794
## region_lLatin America and the Caribbean  11.0162903   8.652110 1.27324903
## region_lNorthern Africa and Western Asia 10.1897826   8.742829 1.16550177
## region_lOther Oceania                    18.8361111   9.270881 2.03174996
## region_lSub-saharan Africa               41.9991667   8.558749 4.90716198
##                                              Pr(>|t|)
## (Intercept)                              6.773511e-01
## region_lCentral and Southern Asia        2.279211e-02
## region_lEastern and South-eastern Asia   1.630995e-01
## region_lEurope and Northern America      9.662375e-01
## region_lLatin America and the Caribbean  2.046088e-01
## region_lNorthern Africa and Western Asia 2.453931e-01
## region_lOther Oceania                    4.368259e-02
## region_lSub-saharan Africa               2.094960e-06
# make a residual plot
plot(lm_infant_mortality_predicted_by_regionL$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

Because of this, it is sensible to use a dummy for those regions where the effect is robust such as: Sub-Saharan Africa, Central and Southern Asia, and Other Oceania (the same ones we chose for child birth rate but only Sub-Saharan Africa is strongly robust).

# fit the model
lm_infant_mortality_predicted_by_regionL <- 
  lm(infant_mortality ~  r_Subsaharan_Africa + r_OtherOceania + r_CS.Asia, 
     data = data) 
# get the correlation table
summary(lm_infant_mortality_predicted_by_regionL)$coeff
##                     Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)         10.48717   1.170505  8.959524 4.072181e-16
## r_Subsaharan_Africa 35.00700   2.143708 16.330116 2.288702e-37
## r_OtherOceania      11.84394   4.309550  2.748301 6.600702e-03
## r_CS.Asia           13.60069   3.525422  3.857889 1.591671e-04
# make a residual plot
plot(lm_infant_mortality_predicted_by_regionL$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

Using all features for the model

# fit the model
lm_infant_mortality_all_features <- 
  lm(infant_mortality ~ r_Subsaharan_Africa + r_OtherOceania + r_CS.Asia +
       mean_age_of_childbearing + any_method + gdp + years_in_school + score, 
     data = data) 
# get the correlation table
summary(lm_infant_mortality_all_features)$coeff
##                               Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)               8.721596e+01 2.028072e+01  4.3004377 3.114477e-05
## r_Subsaharan_Africa       1.752760e+01 2.469567e+00  7.0974399 5.225230e-11
## r_OtherOceania            3.392868e+00 4.535726e+00  0.7480320 4.556524e-01
## r_CS.Asia                 2.839746e+00 3.113812e+00  0.9119837 3.632910e-01
## mean_age_of_childbearing -1.447077e+00 6.697831e-01 -2.1605164 3.237556e-02
## any_method               -2.545828e-01 5.429228e-02 -4.6891172 6.274800e-06
## gdp                       3.769402e-05 7.397471e-05  0.5095528 6.111394e-01
## years_in_school          -1.918339e+00 3.652613e-01 -5.2519619 5.261817e-07
## score                     5.857769e-01 4.690655e-01  1.2488167 2.137446e-01
# make a residual plot
plot(lm_infant_mortality_all_features$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

gdp is not statistically significant. Let’s remove it first before removing the other ones that also are not significant.

# fit the model
lm_infant_mortality_selected_features_1 <- 
  lm(infant_mortality ~ r_Subsaharan_Africa + r_OtherOceania  + r_CS.Asia +
       mean_age_of_childbearing + any_method + years_in_school + score, 
     data = data) 
# get the correlation table
summary(lm_infant_mortality_selected_features_1)$coeff
##                            Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)              82.2512359 16.72203773  4.9187328 2.277793e-06
## r_Subsaharan_Africa      17.2922968  2.47633464  6.9830210 8.910427e-11
## r_OtherOceania            3.3460145  4.50658733  0.7424719 4.589704e-01
## r_CS.Asia                 2.9124843  3.13504016  0.9290102 3.543864e-01
## mean_age_of_childbearing -1.3188009  0.54553363 -2.4174512 1.683752e-02
## any_method               -0.2324605  0.05319905 -4.3696365 2.320696e-05
## years_in_school          -1.8681163  0.35528540 -5.2580723 4.964988e-07
## score                     0.6034146  0.45632607  1.3223322 1.880838e-01
# make a residual plot
plot(lm_infant_mortality_selected_features_1$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

After dropping gdp, dummy regions of Oceania and Central & South Asia are not statistically significant. Let’s drop them.

# fit the model
lm_infant_mortality_selected_features_2 <- 
  lm(infant_mortality ~ r_Subsaharan_Africa +  mean_age_of_childbearing + 
       any_method + years_in_school + score, data = data) 
# get the correlation table
summary(lm_infant_mortality_selected_features_2)$coeff
##                            Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)              86.6859785 15.63956153  5.542737 1.294874e-07
## r_Subsaharan_Africa      16.2808837  2.30026215  7.077838 5.142701e-11
## mean_age_of_childbearing -1.4176880  0.52308599 -2.710239 7.501791e-03
## any_method               -0.2514928  0.05009273 -5.020545 1.434175e-06
## years_in_school          -1.8663711  0.34485164 -5.412099 2.396220e-07
## score                     0.6286242  0.43781274  1.435829 1.531194e-01
# make a residual plot
plot(lm_infant_mortality_selected_features_2$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

After dropping Oceania and Central & South Asia, abortion legality score remains as non-relevant. Let’s remove it from the last model.

# fit the model
lm_infant_mortality_selected_features_3 <- 
  lm(infant_mortality ~ r_Subsaharan_Africa + mean_age_of_childbearing + 
       any_method + years_in_school, data = data) 
# get the correlation table
summary(lm_infant_mortality_selected_features_3)$coeff
##                            Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)              90.0329231 15.51878221  5.801546 3.690442e-08
## r_Subsaharan_Africa      15.6359604  2.26384699  6.906810 1.268724e-10
## mean_age_of_childbearing -1.4087263  0.52487213 -2.683942 8.082807e-03
## any_method               -0.2603678  0.04988321 -5.219547 5.797571e-07
## years_in_school          -2.0469299  0.32222530 -6.352480 2.332289e-09
# make a residual plot
plot(lm_infant_mortality_selected_features_3$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

After dropping abortion law score, years in school, mean age of childbearing and use of contraceptive methods reduce infant mortality and being in a country from Sub-Saharan Africa increases it drastically.

Third Model: Prediction of HIV prevalence

We are trying to predict the HIV prevalence in a country using socioeconomic features and controlling by region. Main variables we are using are:

  • Years in school (years_in_school)
    • Level of education might indicate better knowledge of how how to prevent STDs (including HIV).
  • Gross domestic product (gdp)
    • The income of a region can help us to predict has access to condoms and information.
  • Use of a contraceptive method (any_method)
    • Contraceptive methods imply better sexual health practices.
  • Barrier (barrier)
    • Barrier method has a high risk reduction of contracting HIV.
  • Region (region_l)
    • Region can consider other socioeconomic variables not reflected on previous ones.
  • Score (score)
    • We have an estimated score based on abortion legality which is correlated with unplanned parenthood.

Preliminary Analyses

We are going to perform two types of analyses:

  • Linear association with the outcome
    • We will measure the association of each variable by itself with the outcome we want to predict and also analyze the residual.
  • Correlation Matrix
    • Let’s see how variables correlate by themselves.

Note: As HIV prevalence is not reported by all countries, let’s see how many countries have a null value

sum(is.na(data$HIV_prev))
## [1] 77

77 Countries did not report HIV prevalence. Given this, let’s use only countries reporting it and be aware that reporting might play as a bias.

data_2 <- data %>% filter(HIV_prev > -1) 
Association of outcome with years of School

Years in School does not seem to impact HIV prevalence by itself. Also, residuals are highly loaded to a negative value (let’s be aware of using it to predict HIV prevalence).

# fit the model
lm_HIV_prev_predicted_by_years_in_school <- 
  lm(HIV_prev ~ years_in_school, data = data) 
# get the correlation table
summary(lm_HIV_prev_predicted_by_years_in_school)$coeff
##                  Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)      3.163907 0.88478812  3.575892 0.0004728503
## years_in_school -0.150590 0.09449427 -1.593642 0.1131642202
# make a residual plot
plot(lm_HIV_prev_predicted_by_years_in_school$residuals, 
     pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")

Association of outcome with wealth of the nation (gdp per capita)

GDP shows a negative impact but the impact is not robust. Also, we see that residuals are far from being uniform.

# fit the model
lm_HIV_prev_predicted_by_gdp <- lm(HIV_prev ~ gdp , data = data_2) 
# get the correlation table
summary(lm_HIV_prev_predicted_by_gdp)$coeff
##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  2.435442e+00 4.094515e-01  5.948060 1.835189e-08
## gdp         -4.441732e-05 1.779406e-05 -2.496188 1.363523e-02
# make a residual plot
plot(lm_HIV_prev_predicted_by_gdp$residuals, 
     pch = 16, col = "violet", xlab = "Observations", ylab = "Residuals")

Association of outcome with the use of contraceptive methods

Any method not seem to impact HIV prevalence by itself. Also, residuals are highly loaded to a negative value. It seems that the HIV prevalence might have an error.

# fit the model
lm_HIV_prev_predicted_by_any_method <- lm(HIV_prev ~ any_method , data = data_2)
# get the correlation table
summary(lm_HIV_prev_predicted_by_any_method)$coeff
##                Estimate Std. Error    t value    Pr(>|t|)
## (Intercept)  2.51657153 0.93688942  2.6860924 0.008103997
## any_method  -0.01066428 0.01643212 -0.6489899 0.517408125
# make a residual plot
plot(lm_HIV_prev_predicted_by_any_method$residuals, 
     pch = 16, col = "green", xlab = "Observations", ylab = "Residuals")

Association of outcome with barrier

Residuals remain as not uniform and the effect is not robust.

# fit the model
lm_HIV_prev_predicted_by_barrier <- lm(HIV_prev ~ barrier , data = data_2) 
# get the correlation table
summary(lm_HIV_prev_predicted_by_barrier)$coeff
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 1.5648087 0.45700411 3.4240582 0.0007957091
## barrier     0.0326779 0.03384156 0.9656143 0.3357914411
# make a residual plot
plot(lm_HIV_prev_predicted_by_barrier$residuals, 
     pch = 16, col = "red", xlab = "Observations", ylab = "Residuals")

Association of outcome with abortion legality score

Residuals remain as not uniform and the effect is not robust.

# fit the model
lm_HIV_prev_predicted_by_score <- lm(HIV_prev ~ score, data = data_2) 
# get the correlation table
summary(lm_HIV_prev_predicted_by_score)$coeff
##               Estimate Std. Error   t value    Pr(>|t|)
## (Intercept) 1.70694645  0.4866471 3.5075655 0.000594908
## score       0.06230606  0.1674310 0.3721298 0.710314483
# make a residual plot
plot(lm_HIV_prev_predicted_by_score$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

Association of outcome with region

Not a single region explains HIV prevalence.

# fit the model
lm_HIV_prev_predicted_by_regionL <- lm(HIV_prev ~ region_l, data = data_2) 
# get the correlation table
summary(lm_HIV_prev_predicted_by_regionL)$coeff
##                                              Estimate Std. Error      t value
## (Intercept)                               0.150000000   2.582546  0.058082218
## region_lCentral and Southern Asia         0.038461538   2.774099  0.013864513
## region_lEastern and South-eastern Asia    0.180769231   2.774099  0.065163213
## region_lEurope and Northern America       0.171081081   2.651426  0.064524177
## region_lLatin America and the Caribbean   0.664814815   2.676487  0.248390799
## region_lNorthern Africa and Western Asia -0.004285714   2.760858 -0.001552313
## region_lOther Oceania                     0.250000000   3.652272  0.068450550
## region_lSub-saharan Africa                5.047826087   2.638091  1.913438963
##                                            Pr(>|t|)
## (Intercept)                              0.95376256
## region_lCentral and Southern Asia        0.98895700
## region_lEastern and South-eastern Asia   0.94813319
## region_lEurope and Northern America      0.94864112
## region_lLatin America and the Caribbean  0.80418109
## region_lNorthern Africa and Western Asia 0.99876355
## region_lOther Oceania                    0.94552064
## region_lSub-saharan Africa               0.05764911
# make a residual plot
plot(lm_HIV_prev_predicted_by_regionL$residuals, 
     pch = 16, col = "cyan", xlab = "Observations", ylab = "Residuals")

Using all features for the model Red flag

We startred from the full model:

# fit the model
lm_HIV_prev_all_features <- lm(HIV_prev ~ barrier + any_method + gdp +
                                 years_in_school + score, data = data) 
# get the correlation table
summary(lm_HIV_prev_all_features)$coeff
##                      Estimate   Std. Error    t value   Pr(>|t|)
## (Intercept)      3.539335e+00 1.403389e+00  2.5219905 0.01286716
## barrier          1.049583e-01 5.016233e-02  2.0923738 0.03833489
## any_method       5.929340e-03 2.469437e-02  0.2401090 0.81062106
## gdp             -5.695014e-05 2.721358e-05 -2.0927106 0.03830426
## years_in_school -2.370405e-01 1.753000e-01 -1.3521987 0.17864153
## score           -1.184903e-01 2.203071e-01 -0.5378413 0.59159920
# make a residual plot
plot(lm_HIV_prev_all_features$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

any_method and score are not significant in the full model so we removed them.

# fit the model
lm_HIV_prev_all_features <- 
  lm(HIV_prev ~ barrier + gdp + years_in_school, data = data) 
# get the correlation table
summary(lm_HIV_prev_all_features)$coeff
##                      Estimate   Std. Error   t value    Pr(>|t|)
## (Intercept)      3.114311e+00 9.400400e-01  3.312956 0.001171199
## barrier          9.685712e-02 4.151207e-02  2.333228 0.021040042
## gdp             -4.703231e-05 2.426076e-05 -1.938617 0.054531568
## years_in_school -1.770056e-01 1.321507e-01 -1.339422 0.182573208
# make a residual plot
plot(lm_HIV_prev_all_features$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

yeaers_in_school is not significant so we removed it.

# fit the model
lm_HIV_prev_all_features <- lm(HIV_prev ~ barrier + gdp, data = data) 
# get the correlation table
summary(lm_HIV_prev_all_features)$coeff
##                  Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)  2.026605e+00 0.475240271  4.264379 3.569819e-05
## barrier      6.827073e-02 0.035445320  1.926086 5.602366e-02
## gdp         -6.293099e-05 0.000020753 -3.032381 2.868845e-03
# make a residual plot
plot(lm_HIV_prev_all_features$residuals, 
     pch = 16, col = "purple", xlab = "Observations", ylab = "Residuals")

Not a single variable explains HIV prevalence which is counter-intuitive and is a strong signal that the data might be wrong.